<?php
/**
 * <https://y.st./>
 * Copyright © 2015 Alex Yst <mailto:copyright@y.st>
 * 
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program. If not, see <https://www.gnu.org./licenses/>.
**/

$xhtml = array(
	'<{title}>' => 'Fixing my spider',
	'<{body}>' => <<<END
<p>
	This morning, the code at <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">Stack Overflow</a> looked much more manageable than last night.
	I built a new class based on it that takes a size in bytes during instantiation, then uses that size as a reference when the object is called as a function.
	The example instead used a closure, but I think having a single line such as <code>CURLOPT_PROGRESSFUNCTION =&gt; new curl_limit(1024*1024),</code> is much more readable.
	I opted not to set <code>CURLOPT_BUFFERSIZE</code>, as I do not know what that is supposed to do.
	I think that it somehow makes $a[cURL] progress reports come more frequently, from what the question/answer page was saying, but I do not need that.
	I do not need an exact cut off point, just a way for the download to be prevented from going completely wild.
	I set the download limit to a full megabyte hoping that that would be a high enough limit to allow all regular Web pages to come through, and so far, that seems to be fine.
	No downloads were aborted aside from that singular problem file.
	After getting past that though, the spider quickly ran out of pages to crawl.
	It seems that this website does not link to any onion-based websites that link to many others.
	I will try linking to <a href="http://skunksworkedp2cg.onion/">Harry71&apos;s Onion Spider robot</a> to improve the results I get.
	I fear that this much input could jam something up on my end though.
	My own spider is not at all optimized and it keeps all its known onion addresses in memory at once.
	This entry will be put up prematurely so that i can continue my experiments.
</p>
<p>
	My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
</p>
<p>
	After some more testing, I realized that I needed a couple new features.
	The first breaks unneeded relationships between URIs by sorting the database before saving it.
	The second is more necessary for basic functionality though.
	I found that the spider was repeatedly requesting the same page Harry71&apos;s website using different $a[URI] fragments.
	We do not need to request the same page several times, nor do we want $a[URI] fragments to be in our database.
	A quick search of the existing database showed that there were only two fragments already in the database from the first successful run.
	One was a legitimate anchor that I embedded in a page, but the other was an error in my weblog.
	It confused me at first that the bulk of my anchors were not being found by the spider, but I soon realized that it is because those anchors are on pages accessible over the clearnet.
</p>
<p>
	These two new features in place, I ran the spider once more, starting again from the single-entry database that i had started with.
	Because the first-run database was created today, I figured it would be easier than trying to hand-edit the database to remove the two entries containing fragments.
	The database is currently stored as a serialized array, and last time that i tried editing one of those by had, I kept breaking it.
	After the spider ran a while, I noticed that it was converting some relative $a[URI]s to the incorrect absolute $a[URI]s.
	It seems that I need to account for the special case of files being present in the website&apos;s document root while also linking to relative $a[URI]s.
	I took care of that error, and while I was at it, added a configurable user agent string.
	On the next run, quite a ways in, I found that the <code>&lt;base/&gt;</code> tag was not being properly handles, and once again, I had to restart the spider over.
</p>
<p>
	While speaking with <a href="http://zdasgqu3geo7i7yj.onion/">theunknownman</a> on <a href="ircs://volatile.ch:6697/">Volatile</a>, he asked how my website was put together, so I separated the personal stuff that I do not want in a clearnet repository out into a separate repository, bound the two repositories with symbolic links, and uploaded the <a href="https://notabug.org/y.st./authorednansyxlu.onion.">main compile scripts and templates</a>.
</p>
<p>
	On a more serious note, <a href="https://wowana.me/">wowaname</a> and lucy are hassling theunknownman on <a href="ircs://irc.volatile.ch:6697/%23Volatile">#Volatile</a>.
	Theunknownman had some sort of technical issue with his $a[VPN], and instead of preventing an $a[IRC] connection from being established, his machine connected to the network over clearnet.
	Now they are flaunting the fact that they have his home $a[IP] address, much to his terror.
	He thinks that they will actually do something to him now that they know where he is.
	I do not think that theunknownman is in any real danger, but this shows just how much of a troll wowaname can be.
	She has most of the channel hassling theunknownman simply because she can.
</p>
<p>
	I am hanging out with a band of trolls.
	I need to find better company to keep.
	It is difficult though when most places maliciously discriminate against $a[Tor] users.
	It seems that trolls are pushed into the few places that allow $a[Tor] use, as they use $a[Tor] to evade bans.
	Those of us that do not evade bans, and if fact have done nothing to get banned, get blocked as collateral damage.
</p>
<p>
	Yesterday, the letter saying that mail having my surname would be forwarded to out new address finally came.
	Today, we actually received our forwarded mail too, complete with spam.
	It seems that the mail forwarding has been set up successfully.
</p>
<p>
	My end-of-day progress with the spider did not go as planned.
	It got stuck on <code>costeirahx33fpqu.onion</code> for some reason.
	I waited several hours, but it would not budge.
	I will need to tinker with it more tomorrow to try getting past that issue.
	I might set a time-based timeout to take care of it, as I do not think that this was a large file issue like last time.
</p>
<p>
	I learned something interesting from synapt of <a href="ircs://irc.oftc.net:6697/%23php">#php</a> today.
	Apparently, the architects behind $a[PHP] were not even halfway done writing $a[PHP]6 and people were already writing documentation and even books about how to code in it.
	This documentation and these books were obviously inaccurate, as there was no way to know yet how $a[PHP]6 would turn out, so $a[PHP]6 was canceled altogether to to avoid the confusion that these people had caused.
	Now, the developers are instead working on $a[PHP]7, giving it the features that $a[PHP]6 was going to have.
</p>
END
);
